One of the fundamental concepts of regression/classification models is that, for the most part, observations with similar input will be assigned similar output. The k-Nearest Neighbours algorithm takes this idea to the extreme by assigning a new observation (in the regression case) to be the average of the k observations closest to it in the training set.

Data Generation¶

#Import modules
import numpy as np
import random
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

Conceptually k-NN works exactly the same for a single feature dataset as it does for a higher-dimensional one so we'll use a single feature dataset for easier visualisation¶

n = 200 #Number of observations
X = np.random.uniform(0,5,n)
Y = np.array([x*np.sin(x**2) for x in X]) + np.random.normal(0,1,n)

data = pd.DataFrame({'X':X, 'Y':Y})

plt.figure(figsize = (10,6))
plt.scatter(X,Y)
plt.show()

class kNN:
    
    def __init__(self, data, target, features, trainTestRatio = 0.9):
        
        self.target = target
        self.features = features 
        
        #Split up data into a training and testing set
        self.train, self.test = train_test_split(data, test_size=1-trainTestRatio)
       
    #There's no fitting process for k-NN, per se. To make predictions we simply examine the training set
    #and find the closest examples we can
    
    def predictSingleExample(self, x, k = 5):
        #For a given example x, we find the k examples in the training set where the features are closest
        #in terms of euclidean distance
        
        trainingData = self.train.copy()
        
        #Compute distance from x to each element of training set
        trainingData['distance'] = [np.linalg.norm(np.array(x[self.features]) - np.array(row[self.features])) for idx, row in trainingData.iterrows()] 
        
        #sort by distance and then take the first k
        closestk = trainingData.sort_values('distance', ascending = True).head(k) 
        
        #Now return the mean of those closest k
        return closestk[self.target].mean()
    
    def predict(self, X = None, k = 5):
        #Predict the values for a dataframe of values (in the same format as the training set)
        #Returns a list of the predicted values
        
        #if no X provided, predict values of the test set
        if X is None:
            X = self.test
        
        return [self.predictSingleExample(x, k) for idx, x in X.iterrows()]

Now let's predict on the test set¶

mykNN = kNN(data, 'Y', ['X'])

mykNN.test['Pred'] = mykNN.predict(k = 5)

/opt/anaconda3/envs/cgvae/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

Plot the results

f = plt.figure(figsize=(15,6))
ax = f.add_subplot(121)
ax2 = f.add_subplot(122)

ax.scatter(mykNN.test['X'], mykNN.test['Y'] - mykNN.test['Pred'])
ax.set_xlabel('X')
ax.set_ylabel('Y - Pred')
ax.set_xlim([0, 5])
ax.set_ylim([-5,5])

ax2.scatter(mykNN.test['Y'], mykNN.test['Pred'], label = 'True values vs Predicted Values')
ax2.plot(np.arange(-5,5,0.1), np.arange(-5,5,0.1), color = 'green', label = 'Line y = x')
ax2.set_xlim([-5, 5])
ax2.set_ylim([-5,5])
ax2.set_xlabel('True Label')
ax2.set_ylabel('Predicted Label')
ax2.legend()

plt.show()

The left plot shows the residuals plotted against out input feature, X. We can see that overall, the residuals are scattered about 0, indicating that in general the model did a decent job of capturing the relationship between the feature and target. The residuals exhibit slighly more variance when $x > 3$, but this is not particularly surpising as the target oscillates for large $x$.

If we examine the plot on the right, we can see that the scatter plot of the true vs predicted values generally adheres to the line line $y = x$, indicating the model is performing as we would want.

For a conceptually simple algorithm, k-nearest neighbours has worked surprisingly well on this example - far better, for example, than we would expect linear regression to, due to the highly non-linear feature-target relationship.

Whilst k-nearest neighbours has performed well on a small dataset such as this one - it tends to scale poorly to larger datasets: Larger datasets demand more memory to store and it takes longer to find the nearest neighbours of a new example. Additionally, k-nearest neighbours suffers from the 'Curse of Dimensionality', meaning that its performance tends to degrade substantially as the number of features is increased.

Examining the effect of changing k¶

k is a hyperparameter we can tune and should be thought of as a regulariser. The larger the value of k, the more training examples we use in the prediction and therefore the model will be more robust to slight perturbations as we increase k.

Let's plot the regression line for a variety of different values of k

plt.figure(figsize = (10,6))



dataRegLine = pd.DataFrame({'X':np.linspace(0,5,100)})

for k in [1, 5, 10, 30, 100, 180]:
    
    dataRegLine[f'Pred{k}'] = mykNN.predict(X = dataRegLine[['X']], k = k) #Obtain predictions
    plt.plot(dataRegLine['X'], dataRegLine[f'Pred{k}'], label = f'{k}-Nearest Neighbours') #Plot regression line
    
plt.legend()

plt.scatter(mykNN.train['X'], mykNN.train['Y'])
plt.show()

The plot above shows the effect of changing the value of k. Having k = 1 yields the most non-linear regression line, but has the disadvantage that it essentially just connect the lines between training points. Conversely setting k = 180 returns the same value for all inputs, as we take the average of the whole training set each time we make a prediction. Intermediate values of k exhibit varying degrees of non-linearity, in line with what we might expect.